Strong mitigation: nesting search for good policies within search for good reward

نویسندگان

Jeshua Bratman

Satinder P. Singh

Jonathan Sorg

Richard L. Lewis

چکیده

Recent work has defined an optimal reward problem (ORP) in which an agent designer, with an objective reward function that evaluates an agent’s behavior, has a choice of what reward function to build into a learning or planning agent to guide its behavior. Existing results on ORP show weak mitigation of limited computational resources, i.e., the existence of reward functions so that agents when guided by them do better than when guided by the objective reward function. These existing results ignore the cost of finding such good reward functions. We define a nested optimal reward and control architecture that achieves strong mitigation of limited computational resources. We show empirically that the designer is better off using the new architecture that spends some of its limited resources learning a good reward function instead of using all of its resources to optimize its behavior with respect to the objective reward function.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic

Real-time dynamic programming (RTDP) is a heuristic search algorithm for solving MDPs. We present a modified algorithm called Focused RTDP with several improvements. While RTDP maintains only an upper bound on the long-term reward function, FRTDP maintains two-sided bounds and bases the output policy on the lower bound. FRTDP guides search with a new rule for outcome selection, focusing on part...

متن کامل

Using Online Tools to Assess Public Responses to Climate Change Mitigation Policies in Japan

As a member of the Annex 1 countries to the Kyoto Protocol of the United Nations Framework Convention on Climate Change, Japan is committed to reducing 6% of the greenhouse gas emissions. In order to achieve this commitment, Japan has undertaken several major mitigation measures, one of which is the domestic measure that includes ecologically friendly lifestyle programs, utilizing natural energ...

متن کامل

OPTIMAL DESIGN OF SINGLE-LAYER BARREL VAULT FRAMES USING IMPROVED MAGNETIC CHARGED SYSTEM SEARCH

The objective of this paper is to present an optimal design for single-layer barrel vault frames via improved magnetic charged system search (IMCSS) and open application programming interface (OAPI). The IMCSS algorithm is utilized as the optimization algorithm and the OAPI is used as an interface tool between analysis software and the programming language. In the proposed algorithm, magnetic c...

متن کامل

Inverse Reinforcement Learning based on Critical State

Inverse reinforcement learning is tried to search a reward function based on Markov Decision Process. In the IRL topics, experts produce some good traces to make agents learn and adjust the reward function. But the function is difficult to set in some complicate problems. In this paper, Inverse Reinforcement Learning based on Critical State (IRLCS) is proposed to search a succinct and meaningfu...

متن کامل

Voltage Sag Compensation with DVR in Power Distribution System Based on Improved Cuckoo Search Tree-Fuzzy Rule Based Classifier Algorithm

A new technique presents to improve the performance of dynamic voltage restorer (DVR) for voltage sag mitigation. This control scheme is based on cuckoo search algorithm with tree fuzzy rule based classifier (CSA-TFRC). CSA is used for optimizing the output of TFRC so the classification output of the network is enhanced. While, the combination of cuckoo search algorithm, fuzzy and decision tree...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Strong mitigation: nesting search for good policies within search for good reward

نویسندگان

چکیده

منابع مشابه

Focused Real-Time Dynamic Programming for MDPs: Squeezing More Out of a Heuristic

Using Online Tools to Assess Public Responses to Climate Change Mitigation Policies in Japan

OPTIMAL DESIGN OF SINGLE-LAYER BARREL VAULT FRAMES USING IMPROVED MAGNETIC CHARGED SYSTEM SEARCH

Inverse Reinforcement Learning based on Critical State

Voltage Sag Compensation with DVR in Power Distribution System Based on Improved Cuckoo Search Tree-Fuzzy Rule Based Classifier Algorithm

عنوان ژورنال:

اشتراک گذاری